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Abstract 

The paper introduces a framework for representation and acquisition of knowl- 
edge emerging from large samples of textual data. We utilise a tensor-based, dis- 
tributional representation of simple statements extracted from text, and show how 
one can use the representation to infer emergent knowledge patterns from the tex- 
tual data in an unsupervised manner. Examples of the patterns we investigate in the 
paper are implicit term relationships or conjunctive IF-THEN rules. To evaluate 
the practical relevance of our approach, we apply it to annotation of life science ar- 
ticles with terms from MeSH (a controlled biomedical vocabulary and thesaurus). 



1 Introduction 

The ubiquity of methods for digital content publishing, processing and sharing has led 
to a lot of data being made globally available every day. Such an unprecedented world- 
wide availability of content is generally beneficial, yet it also poses big challenges. For 
■ instance, in as dynamic and voluminous domains as life sciences, it is virtually impos- 

$— i | sible for the users to utilise all the available relevant knowledge in a comprehensive 

and timely manner 11711 . 

Mitigation of this problem (with a special focus on biomedical literature) has served 
as the main motivation for the research presented in this paper. As can be seen for 
instance in |9j, a popular way of tackling the information overload in the context of 
biomedical literature is annotation of articles by terms from standardised biomedical 
vocabularies. Such annotations can in turn make the retrieval of relevant documents 
much more efficient. However, as providing the necessary annotations manually is 
very expensive, automated methods are desired J9), which is what we are going to 
address here. 

The technical contribution of the presented work is two-fold. Firstly, we introduce 
a general framework for automated acquisition of knowledge from textual collections. 
The proposed framework builds on the principles of distributional [7] and emergent [5 1 
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semantics, and allows for inference of complex knowledge patterns within simple co- 
occurrence statements extracted from articles. As a second contribution, we show how 
the knowledge inferred from the text can be applied to unsupervised and parameter- free 
annotation of biomedical articles. 

The rest of the paper is organised as follows. Section [2] gives an overview of re- 
lated work. The framework for emergent knowledge representation and acquisition is 
described in Section [3] The application of the framework to document annotation is 
detailed in Section [4] where we also discuss an experiment we performed to evaluate 
our approach. Section|5]concludes the paper and outlines our future work. 

2 Related Work 

Our approach builds on and shares a lot of similarities with recent works in emer- 
gent and distributional |3| semantics. However, Q is quite restrictive and applies 
the notion of emergence [4| merely to complex patterns arising from simple inter- 
actions of autonomous agents in distributed systems like P2P networks. We are more 
general, focusing rather on inference and analysis of complex patterns emerging within 
large amounts of simple statements being extracted directly from data. This is in accor- 
dance with a recent approach to distributional semantics presented in [3|. We employ 
similar tensor-based structures for representation of data and analysis of the knowledge 
emerging from them. Yet we also augment the work [ 3 1 by an explicit representation 
of data provenance and, more importantly, by a method for mining rules out of the 
distributional representations. The latter is related to the associative rule mining intro- 
duced in [1 1, however, we generalise the state of the art method to make use of our 
distributional (essentially vector-based) representation of the data. 

Regarding the application of our framework to annotation of biomedical articles, 
a body of more or less recent works like iflOl . 0, fl3l . lfl"5l or |9| exists (the sec- 
ond, third and fifth of the approaches are either used or considered for use as a support 
service for the professional annotators of the articles on PubMed, a biomedical lit- 
erature repository). The state of the art methods, however, often require at least an 
indirect input from human users before they can produce annotations of new articles 
automatically. For instance, J2) and J5] require a large corpus of previously annotated 
articles for learning and ranking possible annotations of new resources. Other methods 
like |[T3| require rather sophisticated tuning (e.g., experimenting with parameter set- 
tings or with the processing pipeline composition) for optimum performance on new 
data. This is not the case of our approach, as it can work in a purely unsupervised 
manner off-the-shelf. 

3 Distributional Framework for Emergent Knowledge 
Acquisition 

This section first describes how one can represent the knowledge emerging from textual 
documents at various levels of complexity: (1) simple term co-occurrence statements 
within the documents; (2) an integral view on the statements across the document cor- 
pus; (3) different perspectives of the corpus-wide view for analysing various types of 
emergent semantic phenomena. All levels of the representation are based on compact 
tensor structures (tensor is a generalisation of the scalar, vector and matrix notions; 
see |http : / /en . wikipedia . org/wiki/Tensor| for a more detailed overview). 
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The rest of the section deals with analysis of two particular types of emergent semantic 
phenomena that are relevant to document annotation, our motivating use case. 



3.1 Source Representation 

The first layer consists of a so called source representation G, which captures the 
co-occurrence of terms across a set of documents (a concrete way of extracting co- 
occurrence relationships is presented in Section|4|. Let Ai, A r be sets representing left 
and right arguments of binary co-occurrence relationships (i.e., statements), and L the 
types of the relationships. Furthermore, let P be a set representing provenances of par- 
ticular relationships (i.e., document identifiers). We define the source representation as 
a 4-ary labeled tensor G £ Rl A il x l I 'l x l A »-|x|F^ j t j s a four-dimensional array struc- 
ture indexed by argument-link-argument-provenance tuples, with values reflecting the 
weight (e.g., frequency) of statements in the context of particular sources (0 if a state- 
ment does not occur in a source). For instance, if a statement (protein, different from, 
gene) occurs two times in a source d x , then the element g pro tein,dtf f event from,gene,d x 
of G will be 2. More details are given in Example Q] 



Example 1 Let us consider documents di, cfe, ds, dn and the following terms occurring in 
them: protein domain, protein, domain, gene, internal tandem duplications, mutations, jux- 
tamembrane, extracellular domains (abbreviated as p.d., p., d., g., i.t.d., m., ]., e.d., respectively, 
in the following). Let us further assume that the following statements were extracted from the 
documents: d\ : { (p.d., D, p.), (p.d., T, d.) }, di : { (g., D, p.) }, ds : { (i.t.d., T, m.), (i.t.d., 
I, ].), (i.t.d., I, e.d.) }, di : { (p.d., D, p.) }, where D, T, I are abbreviations for relation terms 
different from, type of, in. When omitting all zero values and representing a four-dimensional 
tensor as a two-dimensional table where the three first columns are for the tensor indices and 
the fourth one is for the corresponding tensor value, we can represent the source with the above 
statements as follows (using statement frequencies as values): 



3.2 Corpus Representation 

The source tensor is a low-level data representation merely preserving the association 
of statements with their provenance contexts. Before allowing for actual distributional 
analysis, the data have to be transformed into a more compact structure C we call cor- 
pus representation. C g ml^fl^M- 4 "-! is a ternary (three-dimensional) labeled tensor 
providing for a universal and compact distributional representation of simple state- 
ments extracted from source documents. A corpus C can be constructed from a source 
representation G using functions a:Mxl->l,w:P->l,/:iixLx A r — » R. 
For each C element c S!Pj0 , c StPt0 = a(J2 d £P w (d)gs, P ,o,d,h(s,p,o)), where g s , P ,o,d 
is an element of the source tensor G and the a, f, w functions act as follows: (1) w 
assigns a relevance degree to each document d € P; (2) / reflects the relevance of the 
statement elements (e.g., mutual information score of the subject and object within the 
source); (3) a aggregates the result of the w, f functions' application. This way of con- 
structing the elements of the corpus tensor from the source representation aggregates 
the occurrences of statements within the input data, reflecting also two important things 
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- the relevance of particular sources (via the w function), and the relevance of the state- 
ments themselves (via the / function). The specific implementation of the functions is 
left to applications - alternatives include (but are not limited to) ranking (both at the 
statement and document level) or statistical analysis of the statements within the input 
data. 

Example 2 A corpus corresponding to the source tensor from Example\l\can be represented 
as depicted below. The w values were 1 for all sources and a, f aggregated the source values 
using relative frequency (in a data set containing 7 statements in total). 
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3.3 Corpus Perspectives 

The elegance of the corpus representation lays in its compactness and universality that, 
however, yields for many diverse possibilities of the underlying data analysis. The 
analysis are enabled by the process of so called matricisation of the corpus tensor C. 
Essentially, matricisation is a process of representing a higher-order tensor using a 2- 
dimensional matrix perspective. This is done by fixing one tensor index as one matrix 
dimension and generating all possible combinations of the other tensor indices within 
the remaining matrix dimension. In the following we illustrate the process on the cor- 
pus tensor from Example [2] 

Example 3 When fixing the subjects ( Ai set members ) of the corpus tensor from Example [2] 
one will get the following matricised perspective ( the rows and columns with zero values are 
omitted): 
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The row and column index abbreviations correspond to Example^!] One can see that the trans- 
formation is lossless, as the original tensor can be easily reconstructed from the matrix by ap- 
propriate re-grouping of the indices. 

The corpus tensor matricisations correspond to vector spaces consisting of elements 
defined by particular rows of the matrix perspectives. Each row vector has a name (the 
corresponding matrix row index) and a set of features (the matrix column indices). The 
features represent the distributional attributes of the entity associated with the vector's 
name - the contexts aggregated across the whole corpus. This can be used for various 
types of analysis and for inference of more complex semantic features emerging within 
the simple statements extracted from the source data. In the following sections, we 
describe two particular types of analysis that are relevant to the motivating use case 
of this paper: (1) computation of related (semantically close) terms; (2) mining of 
conjunctive IF-THEN rules from the data. 

3.4 Computing Related Terms 

By comparing the row vectors in corpus tensor matricisations, one essentially compares 
the meaning of the corresponding label terms, as it is emerging from the underlying 
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data. For exploring the matricised perspectives, one can use linear algebra methods that 
have been proven to work by countless successful applications to vector space analysis 
in the last couple of decades lfl6l l6l [121 . Large feature spaces can be reliably reduced 
to more manageable and less noisy number of dimensions by techniques like singular 

value decomposition or random indexing (seet http : / /en . wikipedia . org/wiki/Dimension_r eduction) ) 
After the (optional) dimensionality reduction, the perspective vectors can be compared 

in a well-founded manner by measures like cosine similarity (see http : / /en . wikipedia . org/wiki/Cosine_ 
as illustrated in Example [4] 

Example 4 Let us add one more matrix perspective to the s/{p, o) one provided in Example\3\ 
It represents the distributional features of right arguments (based on the contexts of relation 
terms and left arguments they tend to co-occur with in the corpus): 
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The vector spaces induced by the matrix perspectives s/(p, o) and o/(p, s) can be used 
for finding similar terms by comparing their corresponding vectors. Using the cosine vec- 
tor similarity, one finds that sim,/i„ „\(p.d., q.) = (i-AH 1 / 7 ) j- 0.2972 and 

J s/ { p,o)^ % /(l/7)2 + (2/7)2y(T77p 

sim /( PtS }(j.,e.d.) = — jMJ=jJ^=l= = 1. These are the only non-zero similarities among 

the terms present in the corpus. This corresponds to the intuitive interpretation of the data repre- 
sented by the initial statements from Example^ Protein domains and genes seem to be different 
from proteins, yet protein domain is a type of domain and gene is not, therefore they share some 
similarities but are not completely equal according to the data. Juxtamembranes and extra- 
cellular domains are both places where internal tandem duplications can occur, and no other 
information is available, so they can be deemed equal ( until more data comes). 

It can be easily seen how the computation of related terms is relevant to the anno- 
tation use case that has motivated the paper. By computing MeSH terms related to the 
content of an article (i.e., terms that have been extracted from it), one can get annota- 
tions that are semantically related to the article even if they are not present in it and/or 
linked to it in any explicit way. 

3.5 Rule Mining 

Another type of emergent semantic pattern we can infer from the matricised corpus 
perspectives are IF-THEN rules. Rules are useful for our motivating use case due to 
their applicability to extension of the basic article annotations - once we know that an 
article has annotations that conform to a rule's antecedent, we can also add annotations 
present in the rule consequent. 

To simplify the presentation, let us consider conjunctive IF-THEN rules of type 
(?x,l 1 ,r 1 )A(?x,l 2 ,r 2 )/\-- ■ A (?x, l k , r k ) (?x,l k+1 ,r k+1 ) A(?x,l n ,r n ) in the fol- 
lowing, where ?x is a variable and /,*, fj , i € { 1, . . . , n} are concrete relation and (right) 
argument terms. An example of such rule is (?x, type of, domain) — > (?x, different 
from, protein), which says that everything that is a type of domain is not a protein. 
The rule mining consists of two steps: (1) using the matrix perspective (p,o)/s for 
finding candidate sets of (h,ri)) tuples that can form rules; (2) using the matrix per- 
spective s/(p, o) for pruning the generated rules based on their confidence. Note that 
other types of single-variable conjunctive IF-THEN rules (i.e., the ones with variable 
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occurring in the second or third position of the rule statements) can be computed in the 
same way, only using different perspectives. 

The first step corresponds to finding all frequent itemsets in a database as described 
in 0]. The row vectors of the (p, 6)1 s matrix are essentially the 'items' - features of the 
rules, i.e., the concrete (ii, r-j), i € {1, . . . , n} tuples. By grouping close vectors, we 
can discover related features that may possibly form rules. Perhaps a simplest way of 
doing this is /c-means clustering based on Euclidean distance [8 | applied to the (p, 6) Is 
matrix. The k parameter is set so that the sizes of the generated clusters correspond to 
the desired maximum number of statements present in a rule. In practice, we recom- 
mend to apply dimensionality reduction to the columns of the matrix. This makes the 
clustering faster, while also leading to noise reduction and better representation of the 
features' meaning in the sense of (3). The described approach effectively replaces the 
process of finding frequent itemsets in (TJ. Using our distributional representation, we 
find promising 'itemsets' not via support in discrete data transactions, but by exploiting 
their continuous latent semantics. 

The second step involves pruning of the previously generated rules using measures 
of support (supp) and confidence (conf). Only rules with sufficiently high confidence 
are kept as a result of the mining process. The measures are computed on a matrix that 
is a transpose of the one used for generating the rules (s/(p, 6) in case of the discussed 
type of rules). We keep the original dimensions of the matrix this time, so that we can 
check for the confidence of the rules using the actual data without any transformations. 

We base the rule pruning on the definitions of support and confidence provided 
in (TJ, however, we generalise the support so that we can fully exploit the power of 
our distributional representation. The classic definition of supp(X) for an itemset (set 
of features to form rule statements) is the relative frequency of rows in the data that 
contain the items in X. This is due to the fact that the data representation in classical 
rule mining is crisp - the rows (transactions) contain only zeros and ones that indicate 
the lack and presence of an item in a transaction, respectively. Our data representa- 
tion is more general - zeros in the matrix still mean lack of an item in the given row, 
however, the actual presence of items is represented in a more fluid way by real-valued 
weights. Therefore we define the generalised support as a function supp : 2 F — s- M, 
where F is a set of rule features (i.e., the (p, 6) column labels of the corpus perspective 
matrix on which the rules are being tested - s/(p, 6) for the type of rules discussed 
above). The support of a feature set X on a perspective matrix M is computed as 

/y\ m' 2 

supp(X) = p^ri X)zg/ X — Tin — ~- Ix is a set of all row indices of the matrix M 
where all the features from X are present (i.e., have a non-zero value), and rrtij is an 
element of the matrix M with indices \\M\ \ is a matrix norm (i.e., 'size') defined 

as 1 1 M 1 1 = ~^2 ieI { m . ^ke JAm J fc #o}| ' wnere I? J are sets of row and column indices 
of M, respectively. The confidence of a rule X —> Y is then computed as defined 
in Q~), i- e -> conf(X — > Y) = , only using the generalised support. The 

process of rule mining is further illustrated in Example[5]in the end of this section. 

The proposed definition of support essentially computes weighted relative frequency 
of the input feature set X in the matrix rows. Only rows that contain all features con- 
tribute to the absolute frequency count. The actual contribution is computed as a nor- 
malised Euclidean size of the row vector restricted only to the column indices from 
X. The normalising factor is the size of the feature set (this to make the support value 
independent on the size of X). The absolute weighted frequency of the feature set is 
then divided by \\M || to get the relative frequency (analogically to the classical def- 
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inition of support). ||M|| reflects the size of the real- valued data set as a sum of all 
weights in the matrix M , normalised by the number of non-zero elements per each row 
(this makes also the norm independent on the size of all potential feature sets). One 
can easily check that if the values in the matrix M are just zeros and ones as in the 
traditional data representation used by flj, our support becomes the classical one. 

Example 5 Building on the previous examples, the 'training ' matrix for the rule mining is a 
transpose of the one given in Example\3\ 
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No dimension reduction is applied due to simplicity of the example. The testing matrix is the 
original one, i.e., s/{p, o) from Example\3\ The Euclidean distance between any two of the last 
three vectors in the 'training' matrix (p,o)/s is 0. The distance between the first two vectors 
is di,2 = y/(l/7 — 2/7) 2 + (1/7 — 0) 2 = The distances between the first and second 
and any of the last three vectors are di,3_ 5 = \/{l/7 — 0) 2 + (1/7 — 0) 2 + (0 — 1/7) 2 = 

and d2,3-5 = a/ (2/7 — 0) 2 + (0 — 1/7) 2 = t=5, respectively. The minimum-distance 
grouping of the vectors into clusters containing at least two elements is thus as follows: G\ : 
{(D,p.),(T,d.)},G 2 :{(T,m.),(I,j.),(I,e.d.)}. 

Let us abbreviate the rule statements corresponding to the 'training ' matrix above as fol- 
lows: si : (?x, D,p.), S2 : (?x,T,d.), S3 : (?x,T,m.),S4 ■ (?x,I,j.), s$ : (?x,I,e.d.). Then 
the groups Gi,G2 generate these 14 rules: R1—R2 : Sj — > Sj,i,j G {1,2}, R3—R5 : Si — > 
Sj A s k ,i,j,k e {3,4,5}, Re-Rs ■ Si A Sj -> s k ,i,j,k G {3,4,5}, i? 9 --Ri4 : Si -> 
Sj,i,j G {3,4,5}. 

The 'testing' matrix is s/{p, o) (see Example\3j) and its size is 0.5. Corresponding supports 
of the relevant sets of rule statements are: supp({si}) = supp({s2}) = |, supp({ss}) = 
supp{{s 4 }) = supp({s 5 }) = j,supp({s 1 ,s 2 }) = ^,supp({s 3 ,s 4 }) = supp({s3,s 5 }) = 
supp({.34, S5}) = supp({s3, S4, S5}) = 2^1. Thus the confidences of the rules are: 
conflRx) = conf(R 2 ) = ^,conf(R3) = conf(Ri) = conf(R 5 ) = ^-,conf(Re) = 
conf(R 7 ) = conf(Rs) = ^-,conf(R 9 ) = conf(R w ) = •■• = conf{R 14 ) = When 
setting the confidence threshold to 0.5, the rules R1—R5 are discarded. 

4 Automated Document Annotation 

This section illustrates the practical potential of the general framework introduced so 
far. First we describe its application to unsupervised annotation of biomedical articles 
with terms from the MeSH thesaurus. Then we present the evaluation of our approach 
and discuss the results obtained. 

4.1 Data and Method 

As a corpus of documents for annotation, we employed 2, 003 articles from the PubMed 
repository ( |http : / / www . ncbi . nlm . nih . gov/ pubmed/) that had their fulltexts 
available from PubMed Central ( |http : / /www . ncbi . nlm. nih . gov/pmc/] l. The 
articles were selected so that for each article present, the corpus also contained corre- 
sponding related articles as offered by the PubMed's related articles service ATI . This 
fact was important for the evaluation later on. For the article annotation, we used the 
MeSH 201 1 version (obtained at |http : / /www . nlm . nih . gov/mesh/f ilelist . html) . 
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We processed the data using the following high-level pipeline: (1) extraction of 
statements from the articles and from MeSH; (2) incorporation of the extracted state- 
ments into two separate knowledge bases for PubMed articles and for MeSH thesaurus; 
(3) construction of basic MeSH annotation sets for each article; (4) mining of rules 
from the MeSH knowledge base; (5) rule-based extension of the basic annotation sets; 
(6) evaluation of the initial and extended sets of annotations. 

In the extraction step, we were focusing on simple binary co-occurrence statements. 
We tokenized the article text into sentences, then applied part of speech tagging and 
shallow parsing in order to determine noun phrases. Any two noun phrases NPi,NP2 
occurring in the same sentence formed a statement (NP±, R, NP2), where R stands 
(here and in the following) for a relatedJo relationship expressing a general relat- 
edness between the left and right arguments. Any synonyms of MeSH terms in the 
statements were converted to the corresponding preferred MeSH headings in order to 
lexically unify the data. 1, 379, 235 statements were generated from the 2, 003 articles 
this way. From the MeSH data set, we generated (Ti, R, T2) statements for all terms 
(i.e., headings) Ti, T2 such that they were parent, child or sibling of each other in the 
MeSH hierarchy, which led to 41,632 statements. Note that for both data sets, we 
considered the R relation symmetric, which effectively made the s/ (p, o) and 0/ (s,p) 
perspectives equivalent in the consequent steps. 

The adopted model of co-occurrence limited to a single general relationship R may 
seem to be restrictive, however, we chose to do so to be able to link the semantics of 
the data extracted from articles with the semantics of MeSH in the most general sense 
applicable. Apart of that, [ 14| suggest that in settings similar to ours, such 'flattened' 
semantics actually perform better than a model with multiple relations. 

The second step in the experimental pipeline was incorporation of the extracted 
statements into knowledge bases (i.e., the source, corpus and perspective structures de- 
scribed in Section [3]). The incorporation was done in the same way for both PubMed 
and MeSH data. The source (G) values were set to 1 for all elements g s , P ,o,d such 
that the statement (s,p, o) occurred in the document d; all other values were 0. To get 
the corpus (C) tensor values c s . p o , we multiplied the frequency of the (s,p, 0) triples 
(i.e., J2 deP g s .p.o,d) by the point-wise mutual information score of the (s, o) tuple (see 
|http : / / en . wikipedia . org/ wiki/Pointwise_mutual_inf ormation] !. 

The annotations for each article d were computed using the article knowledge 
base as follows. First we constructed a set TF = {(t, fd(t))\t E d}, where t are 
all terms extracted from d and fd(t) is the absolute frequency of the term t in d. 
For each (t, fd(t)) tuple from TF, we computed another set REL t = {(t',fd(t) ■ 
sim s /^ po ^(t,t'))\sim s /^ po ' j (t,t') > 0}. Rephrased in prose, the REL t sets contained 
tuples of all terms similar to t and the actual similarities multiplied by the fd(t) fre- 
quency (more frequent terms should generally produce terms with higher relatedness 
value). The sim s /^ p o ^ similarity function was defined as in Example [4] Eventually, 
we collated the particular term relatedness values across the whole document d into an 
overall relatedness rel(t') = ^ J2 W £W t , w > where W t > = {r\(t',r) £ \J t( - d REL t } 
and W is a sum of all the relatedness values occurring in the |J tgd REL t union. The 
final output of this step for each document d was a set of all related terms t' such that 
t' is in MeSH. The rel(t') values were used for ranking the set of MeSH annotations 
and taking only the top ones if necessary. 

The rule mining part of the experimental pipeline was executed iteratively with dif- 
ferent random initialisations of the clusters until no new rules were added in at least 
10 most recent iterations. We obtained 33, 384 rules with confidence at least 0.5 this 
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way. The rules were then used for extending the basic article annotation sets as fol- 
lows. Let us assume an article d has annotations {ii, i2> • • • i £«}• Then for any rule 
(?x, R, ex) A (?ar, i?, e 2 ) A • • • A (?ac, i?, e fc ) -> R, e k+1 ) A (?x, i?, e k+2 ) A • • • A 
(?x, i?, e m ) such that {ii, is, • ■ • , t n } ^ { e ii e 2, • ■ • , efc}, we used the consequent set 
{efc+i, efc+2, . . . , e m } as extended annotations for the article d. The relatedness mea- 
sure of the extensions e was computed as ^ Smec u; ' wnere Ce is a set of confi- 
dences of all rules that contributed with the extension e, and W is a sum of all such 
confidences across all extensions computed. Similarly to the basic annotation sets, 
the relatedness of the extensions was used for their ranking and possible restriction to 
top-scoring ones. 

Note that the data we have been working with, as well as the library and scripts we 
have implemented for the experiment, are available for reference at |http : / / dl . dropbox . com/u/2 137 922 6/a; 

4.2 Evaluation and Discussion 

To evaluate the annotation sets produced in the experimental pipeline, we used two 
methods. Firstly, we measured precision and recall of the basic and extended annota- 
tion sets based on their comparison with manually provided MeSH annotations of the 
corresponding articles (available through the PubMed's Entrez API). For each article, 
we computed average precision, precision and recall lfl2l of all computed annotations 
and also of top h ones, where h is the number of human annotations for the given 
article. 

The second evaluation method focused on the utility of the computed annota- 
tions, namely in the task of finding related articles. We used a standard vector space 
model [16] for determining the relatedness of documents, where features were formed 
by the sets of computed or manually assigned article annotations. For each document, 
we computed different sets of related documents (based on the human annotations and 
on the basic/extended ones generated by our framework). To determine their precision 
and recall, the computed sets were compared to corresponding sets of related articles 
provided by the dedicated PubMed servic^H Similarly to the evaluation of annota- 
tions themselves, we measured average precision, precision and recall of all and of top 
h related articles computed, where h was the number of related articles in the gold 
standard. 

The results of the evaluation are summarised in Tables Q] and [2] The mean average 
precision (MAP), precision and recall lines in the tables were computed as an arithmetic 
mean across the particular values for all 2, 003 articles in the experimental corpus. The 
F-score (F\ in particular) was computed from the mean precision/recall values. The 
columns in the tables correspond to the types of the article annotation sets described 
above. BASE, EXT. refer to the basic and extended annotations, while ALL, TOP refer 
to complete and top- ft, only annotation sets. Note that we did not include the EXT/TOP 
annotations into the result summaries, since they were performing significantly worse 
than the other ones in most of the measured categories. 

The comparison with the manually curated MeSH annotations in Table Q] does not 
look particularly impressive, with highest precision and recall values of 16.4% and 
12.7%, respectively. On the other hand, the automatically computed annotations per- 

'The service is based on algorithms described in [ 1 V\. This is obviously less desirable than a gold standard 
designed solely by human experts, however, no such gold standard was readily available for all PubMed 
articles we processed and we lacked the manpower to create it ourselves. In this situation, we considered 
the state of the art service currently endorsed by the PubMed staff and millions of users as a reasonable 
alternative to a hand-crafted gold standard. 
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BASE 
ALL | TOP 



EXT. 

ALL 



Table 1 : Evaluation results (article annotation) 



formed much better than the 'manual' ones when using them as features for finding 
related articles. As can be seen in Table|2] there is a substantial improvement namely 



BASE-ALL 
ALL | TOP 



BASE-TOP 

ALL | TOP 



EXT-ALL 
ALL | TOP 



HUMAN 
ALL | TOP 



Table 2: Evaluation results (annotation utility) 



regarding precision and overall F-score. The only measure where the manually curated 
annotations perform slightly (ca. 1 . 1-times) better than the next-best automated method 
is recall. Especially notable is the difference in precision - the extended annotations 
achieve more than 91%, which is about two-times better than the human ones. 

The results we obtain may have several interpretations. We believe that one of the 
more plausible ones is related to the nature of the manually provided MeSH annota- 
tions. As mentioned for instance in lfl3l . the goal of PubMed annotators is to provide 
best MeSH 'tags' for the purpose of indexing in digital library collections. Thus they 
are motivated to select annotations that better discriminate papers from each other. This 
may, however, be rather detrimental when the task is to identify related papers using 
the annotations, as features used for identifying relatedness (i.e., similarity) are often 
dual to the features used for discrimination of entities 1 18 1. This reasoning can in turn 
explain why our automatically computed article annotations, apparently very different 
from the manually curated ones, perform significantly better when used as features for 
finding related articles. The better performance (especially in case of the precision of 
extended annotations) may indicate that the automatically computed annotations are 
selected in a more fine-grained manner and from a more varied 'vocabulary' than the 
ones provided by human annotators, who can hardly grasp the scale of all the hypothet- 
ically available annotations (in addition to having different motivations as mentioned 
before). This is not to say that either kind of annotations is worse than the other, it 
much rather means that they simply serve slightly different purposes. 

To conclude the discussion, we believe that despite of the low performance of our 
approach in terms of comparison with manually curated MeSH annotations, we can still 
offer potentially very beneficial results (especially in case of annotations augmented 
by emergent rules). This holds particularly for use cases where the annotations are 
supposed to be produced in a scalable and economical way in order to determine sim- 
ilarities between articles. Examples of such use cases include not only identification 
of related documents, but also question answering or automated linking of publica- 
tions and supplementary data (e.g., biomedical data in the RDF format provided at 
|http : //linkedlifedata . com/sources| which we can easily incorporate as 
implied by OH). 
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5 Conclusions and Future Work 



We presented an approach to acquisition of complex knowledge patterns emerging 
within simple statements extracted from textual data. The distinctive features of our 
approach are unification of the principles of emergent and distributional semantics, 
and a novel method for mining rules from the proposed distributional representation. 
To demonstrate the practical relevance of our work, we applied it to annotation of 
PubMed articles with terms from the MeSH thesaurus. After discussing our results, we 
identified areas where our approach can likely bring most benefits to users. 

In future, we will explore more use cases and investigate other types of knowledge 
patterns (e.g., emergent formation of new candidate concepts and taxonomical relations 
to be recommended for inclusion into the MeSH thesaurus). Regarding the presented 
use case, we intend to look into possible combinations of our approach and relevant 
state of the art (namely the ranking-based methods like [9 1 or 1 15 ]). This is also related 
to deeper evaluation of our work that would utilise the state of the art approaches as a 
base-line (currently we were not able to do so comprehensively enough due to lack of 
publicly available and applicable implementations). Eventually, we want to perform a 
qualitative evaluation of the annotations produced by our system with an assistance of 
domain experts. 
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